This is a simple example analysis of data including import from Excel, data structuring and plotting. The data in this case happens to be optical density data over time (replicate growth curves for a microorganism) but the nature of the data matters little to the basics introduced.
library(readxl) # fast excel reader
#library(googlesheets) # fast google spreadsheet reader (not used here but could be useful)
data.raw <- read_excel("example.xlsx", skip = 1)
## Warning in strptime(x, format, tz = tz): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/Los_Angeles'
library(knitr) # the package that renders R markdown and has some good additional functionality
kable(data.raw)
| Time | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 | sample7 | sample8 | sample9 | sample10 | sample11 | sample12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2015-10-05 10:00:00 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| 2015-10-05 12:25:00 | 0.004 | 0.002 | 0.002 | 0.002 | 0.002 | 0.001 | 0.004 | 0.003 | 0.003 | 0.005 | 0.002 | 0.003 |
| 2015-10-05 14:10:00 | 0.003 | 0.005 | 0.002 | 0.004 | 0.003 | 0.002 | 0.003 | 0.003 | 0.005 | 0.003 | 0.004 | 0.005 |
| 2015-10-05 16:27:00 | 0.003 | 0.007 | 0.012 | 0.006 | 0.008 | 0.006 | 0.006 | 0.007 | 0.007 | 0.006 | 0.004 | 0.007 |
| 2015-10-05 18:55:00 | 0.014 | 0.014 | 0.015 | 0.020 | 0.018 | 0.017 | 0.020 | 0.015 | 0.016 | 0.010 | 0.015 | 0.016 |
| 2015-10-05 20:30:00 | 0.030 | 0.031 | 0.030 | 0.034 | 0.026 | 0.027 | 0.027 | 0.022 | 0.020 | 0.022 | 0.024 | 0.025 |
| 2015-10-06 03:10:00 | 0.034 | 0.036 | 0.036 | 0.064 | 0.059 | 0.057 | 0.104 | 0.107 | 0.123 | 0.136 | 0.158 | 0.147 |
| 2015-10-06 09:20:00 | 0.029 | 0.033 | 0.031 | 0.053 | 0.051 | 0.051 | 0.083 | 0.093 | 0.109 | 0.144 | 0.133 | 0.139 |
| 2015-10-06 11:10:00 | 0.029 | 0.029 | 0.032 | 0.050 | 0.050 | 0.048 | 0.081 | 0.090 | 0.105 | 0.141 | 0.131 | 0.133 |
| 2015-10-07 11:00:00 | 0.026 | 0.033 | 0.035 | 0.051 | 0.049 | 0.049 | 0.076 | 0.078 | 0.085 | 0.125 | 0.131 | 0.116 |
Turning the wide format excel data into long format. Note: here we make use of the pipe operator %>%, which just simplifies chaining operations.
library(tidyr) # for restructuring data very easily
data.long <- data.raw %>% gather(sample, OD600, -Time)
# melt <- gather(raw, sample, OD600, -Time) # this would be identical without using %>%
Introducing time in hours.
library(dplyr, warn.conflicts = FALSE) # powerful for doing calculations on data (by group, etc.)
data.long <- data.long %>% mutate(time.hrs = as.numeric(Time - Time[1], units = "hours"))
First plot of all the data
library(ggplot2) # powerful plotting package for aesthetics driven plotting
p1 <-
ggplot(data.long) + # initiate plot
aes(x = time.hrs, y = OD600, color = sample) + # setup aesthetic mappings
geom_point(size = 5) # add points to plot
print(p1) # output plot
data.info <- read_excel("example.xlsx", sheet = "info")
Show all information (these are the experimental conditions for each sample)
kable(data.info)
| sample | substrate |
|---|---|
| sample1 | background H2 |
| sample2 | background H2 |
| sample3 | background H2 |
| sample4 | +0.5mM Ac |
| sample5 | +0.5mM Ac |
| sample6 | +0.5mM Ac |
| sample7 | +1.25mM Ac |
| sample8 | +1.25mM Ac |
| sample9 | +1.25mM Ac |
| sample10 | +2mM Ac |
| sample11 | +2mM Ac |
| sample12 | +2mM Ac |
Combine OD data with sample information.
data.all <- merge(data.long, data.info, by = "sample")
Reuse same plot using %+% to substitute the original data set with a new one and changing the color to be determined based on the new information we added (but keep everything else about the plot the same).
p1 %+% data.all %+% aes(color = substrate)
To make the figure a little bit easier to navigate, we’re going to summarize the data for each condition (combine the replicates) and replot it with an error band showing the whole range of data points for each condition. We could reuse the plot p1 again, but for clarity are constructing the plot from scratch instead.
data.sum <- data.all %>%
group_by(time.hrs, substrate) %>%
summarize(
OD600.avg = mean(OD600),
OD600.min = min(OD600),
OD600.max = max(OD600))
data.sum %>% head() %>% kable() # show the first couple of lines
| time.hrs | substrate | OD600.avg | OD600.min | OD600.max |
|---|---|---|---|---|
| 0.000000 | +0.5mM Ac | 0.0010000 | 0.001 | 0.001 |
| 0.000000 | +1.25mM Ac | 0.0010000 | 0.001 | 0.001 |
| 0.000000 | +2mM Ac | 0.0010000 | 0.001 | 0.001 |
| 0.000000 | background H2 | 0.0010000 | 0.001 | 0.001 |
| 2.416667 | +0.5mM Ac | 0.0016667 | 0.001 | 0.002 |
| 2.416667 | +1.25mM Ac | 0.0033333 | 0.003 | 0.004 |
p2 <- ggplot(data.sum) + # initiate plot
aes(x = time.hrs, y = OD600.avg, ymin = OD600.min, ymax = OD600.max,
fill = substrate) + # setup global aesthetic mappings
geom_ribbon(alpha = 0.3) + # value range (uses ymin and ymax, and fill for color)
geom_line() + # connect averages (uses y)
geom_point(shape = 21, size = 5) + # add points for averages (uses y and fill for color)
theme_bw() + # style plot
labs(title = "My plot", x = "Time [h]", y = "OD600", color = "Condition") # add labels
print(p2)
Note that we could also have had ggplot do the whole statistical summarising for us using stat_summary but it’s often helpful to have these values separately for other calcluations and purposes.
Now could e.g. focus on a subset of data but reuse same plot using %+% to substitute the original data set with a new one (but keep everythign else about the plot the same).
p2 %+% filter(data.sum, !grepl("background", substrate), time.hrs < 25)
Save this plot automatically as pdf by setting specific plot options in the r code chunk
print(p2)
Last, you can make simple interactive (javascript) plots out of your original ggplots (plotly does not yet work great for all ggplot features but it’s a start for easy visualization). You can of course construct plotly plots without ggplot for more customization too but that’s for another time.
library(plotly, warn.conflicts = FALSE)
ggplotly(p1)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`